† Corresponding author. E-mail:
Project supported by the National Natural Science Foundation of China (Grant Nos. 61925402 and 61851402 ), Science and Technology Commission of Shanghai Municipality, China (Grant No. 19JC1416600), the National Key Research and Development Program of China (Grant No. 2017YFB0405600), and Shanghai Education Development Foundation and Shanghai Municipal Education Commission Shuguang Program, China (Grant No. 18SG01).
Facing the computing demands of Internet of things (IoT) and artificial intelligence (AI), the cost induced by moving the data between the central processing unit (CPU) and memory is the key problem and a chip featured with flexible structural unit, ultra-low power consumption, and huge parallelism will be needed. In-memory computing, a non-von Neumann architecture fusing memory units and computing units, can eliminate the data transfer time and energy consumption while performing massive parallel computations. Prototype in-memory computing schemes modified from different memory technologies have shown orders of magnitude improvement in computing efficiency, making it be regarded as the ultimate computing paradigm. Here we review the state-of-the-art memory device technologies potential for in-memory computing, summarize their versatile applications in neural network, stochastic generation, and hybrid precision digital computing, with promising solutions for unprecedented computing tasks, and also discuss the challenges of stability and integration for general in-memory computing.
Under the wave of artificial intelligence (AI) and 5G communication, how to process a large amount of data more efficiently is the fundamental problem of information technology.[1,2] According to the statistics from international data corporation (IDC), the total amount of global data is expected to reach to 100 ZB by 2023.[3] As this vast amount of information will be processed in the calculation unit and recorded in the memory unit, future requirements for computing and memory will be greatly enhanced. However, the distinct separation of computation and storage in the modern computing system has always been a natural disadvantage since being invented.[4] The speed difference between computation and memory is increasing, and a large amount of energy and time is wasting in the process of data transports, which is the so-called ‘memory wall’[5] on account of the von Neumann bottleneck. Besides, while the feature size of silicon-based integrated circuits keeps on approaching its physical limit, the computational performance improvement caused by the size reduction becomes more and more distressed, and the heat dissipation problem caused by leakage current is more and more non-negligible.[6] For instance, Google’s AI recognition network spent a cluster of 16000 processor cores on training for three days while expending 100 kW of power.[7] Therefore, how to further reduce power consumption and improve performance in future integrated circuits (IC) is a concern of researchers. Now a more common method is to use graphics processing unit (GPU)[8,9] or accelerator[10,11] to improve the ability of parallel computing, alone with increasing the bandwidth of the memory,[12,13] but with only a limited improvement in speed and energy consumption, such way does not fundamentally solve the bottleneck problem.[14]
In-memory computing, namely, computing at the site where data is stored is considered as one of the ultimate solutions.[15–19] This new computing architecture does not require data movement costs, and is expected to completely break the limitations of the memory wall by high-throughput in situ data processing. It was already proposed in 1969 to integrate the computational and storage functions of the chip on one unit,[20] but benefited from the monetization of Moore’s law and the convenience of separate design of memory and calculator, people paid no attention beyond the von Neumann structure in those days. Only until recently in-memory logic operations[21,22] and matrix-vector multiplication (MVM)[23–26] have demonstrated potentially improved power/time efficiency that researchers are trying to explore different schemes to enable general in-memory computing for future.
Various emerging memory device technologies have been brought in computing hierarchy and shown orders of magnitude improvement in computing efficiency. Such as resistive switch devices like resistive random access memory (RRAM),[27,28] phase change memory (PCM),[29,30] and magnetic tunnel junctions (MTJ),[31,32] with similar but simple three-layer structure, they all rely on the physical resistance to represent storage state. It is only necessary to apply the voltage across the terminals to change the characteristics of the material, and after the voltage is removed, the state can remain unchanged inherently, which laid the foundation for the realization of in-memory computing, providing a feasible solution for greatly improving the efficiency of MVM. The new usage of charge-based devices also provides new ideas for implementing in-memory computing, such as flash,[33] SRAM,[34] and FeFET.[35,36] These three-terminal charge-based field-effect transistors (FETs) are based on mature silicon manufacturing techniques, so they are closer to commercial availability. Meanwhile, adding in situ memory to the logical calculation, that is, the memory and calculation are as close as possible in the physical position, can also realize in-memory computing, reduce the data transportation cost through performing in situ storing and processing the captured data, and produce ‘highly processed’ information.[37,38]
In terms of applications, diversified memory technologies can perform specific functions according to their own characteristics for various digital or analog tasks. As for analog applications, many researchers have realized the simple neural network calculation by constructing a crossbar array,[39,40] realizing letter recognition,[41] image classification,[42] sparse coding,[43] etc. Analog neural network computing also provides new computational paradigms like neuromorphic computing,[44–46] which is expected to perform the brain-like synaptic function to mimic the human brain and is one of the long-term goals of chip manufacturing. Randomness is another potential advantage of in-memory computing for stochastic applications.[47–49] As for hybrid precision applications, in-memory binary computing can be easily achieved by resistive switching with lower energy and area consumption,[50,51] while more degrees of states further allow the accumulative computing application.[52,53] Although some of these studies are still in the preliminary stage, we still see the advantages and characteristics of in-memory computing in vectormatrix multiplication and brain-like computation. We should also be aware that there are still many shortcomings in the current in-memory computing technology,[18,19] such as the problem of device stability, large-scale integration, more applicable and mature algorithm. In order to implement high-performance computing for the benefit of human beings, it is necessary to overcome these difficulties and challenges.
This review focuses on the current situation of cutting-edge research on in-memory computing technologies, explores the problems encountered in large-scale fabrication and applications, and evaluates the possible solutions. In Section
The implementation principle of in-memory computing is usually determined by the underlying basic unit. In general, in-memory computing needs a memory portion to store information, so it has certain commonalities with many other memory technologies.[54,55] By far, most of in-memory computing methods are developed and modified from the mature memory devices through adding computational functions. Among these devices that can perform in-memory computing, we can divide them into three types (Fig.
The emerging non-volatile memory, mainly depending on the physical state changes of specific materials to complete the representation of information, is different from the traditional electronic devices that control charges. The information they store can be reflected by resistance, phase, magnetism, etc. The external performance of these devices, namely, the current response, is different under the action of the electric field but can be attributed to the change in resistance state, so they are also collectively referred to as resistive switch devices.[16]
RRAM, also known as memristor in other literature, has been first reported as early as in the 1960s as the reversible resistive effect induced by the electric pulse.[61,62] However, due to the limitations of semiconductor technological conditions and market demand at that time, it did not receive widespread attention. In 1971, Chua proposed a theoretical model of memristor,[63] which predicted the existence of such a device whose resistance state will change with the history of the applied voltage. In 2008, HP company prepared an RRAM device based on TiO2 in experiments,[59] and firstly connected RRAM with memristor. Usually using a metal–dielectric–metal sandwich structure, as shown in Fig.
First proposed by Stanford university’s Ovshinsky in the 1960s,[72] the phase change materials that make up the PCM mainly exist in chalcogenides (such as GeSbTe[73] and GeTe[74]), which can be stabilized in a polycrystalline or amorphous state. Polycrystalline is long-range ordered with low resistivity while amorphous short-range ordered with high resistivity, by which different data can be stored. The structure of PCM is shown in the inset of Fig.
MRAM is used for information storage through a magnetic tunnel junction (MTJ).[88] The unit structure is shown in the inset of Fig.
Traditional flash memory is a considerably mature and highly commercial non-volatile memory that is available everywhere. Flash memory is usually divided into two categories, one is the structure of NAND, which can achieve ultra-high-density storage, but the read/write speed is slower; the other is the NOR structure, which can operate independently for each bit unit and the read/write speed is also faster, but it is relatively large. The basic unit of flash is the floating gate transistor, whose basic structure is shown in the inset of Fig.
FeFET is a combination of traditional transistors and ferroelectric materials.[106] Due to the spontaneous polarization characteristics of ferroelectric materials, the atoms in the crystal can move over the potential barrier after applying a suitable external field (required to be larger than the coercive field Ec), leading to the inversion of the intrinsic polarization. This characteristic that the spontaneous polarization can change direction with the external electric field is called ferroelectricity. When the electric field is removed, the polarization state can be maintained so the stable passive/negative polarization states can exactly correspond to the values of ‘1’/‘0’ in binary logic, forming a binary switch.[107] The two stable states can be repeatedly reversed by the electric field, resulting in a hysteresis curve of the polarization on the electric field, as shown in Fig.
SRAM memory cell has a variety of structures, such as 4T, 6T, 8T, and 12T, among them, the 6T structure can realize the better performance, thereby the cell structure generally adopts the 6T structure.[113] As shown in the inset of Fig.
On the road of device miniaturization, people have found nanoscale materials with new characteristics, such as carbon nanotubes[118] and two-dimensional materials.[119] These nanomaterials can not only make transistors smaller in size and continue Moore’s law, but also improve their energy efficiency and speed by orders of magnitude. In addition to the performance improvements, these new materials can also be vertically stacked using their structural advantages for in situ storage above the cell. In Fig.
Two-dimensional material because of its extremely thin layer structure can also realize the new type of logic and memory (Fig.
Benefited from the reduction of operations and the access cost between storage and calculation, in-memory computing can free up more possibilities to fulfill the unprecedented computing paradigms for versatile applications. The natural fusion of memory and computing makes the memory devices behave more like biological neurons, suitable for intelligent application by constructing the neural network. Also the intrinsic randomness inside the memory devices provides a strong guarantee for stochastic generation, which is the cornerstone of stochastic computing beyond the conventional serial computing. Moreover, in-memory computing still possesses the ability of hybrid precision digital computing to ensure sufficient accuracy of the calculation requirements.
As the core technology of information society, IC chips can be divided into digital chips, memory chips, and analog chips according to their function and market share. Nowadays most computing chips are adopting digital logic operations. Thanks to the development of silicon-based technology, CMOS logic chips can use simple binary signals to complete complex data processing, which can perform arithmetic operations expediently with digital logic and occupy a large share in the computing market due to their high integration density and small size. Nevertheless, with the maturity of new device technologies such as RRAM and PCM, analog neural computing that has been neglected has also become feasible for large-scale applications. Because the process of the analog circuit continuous signals, the anti-interference ability and calculation accuracy are not as good as digital computing, but for specific algorithms, memory-based analog computing can achieve higher efficiency with neural network architecture.
Using Ohm’s law for multiplication and Kirchhoff’s current law for summation, vector-matrix multiplication, the foundation of neural network computing, can be easily mapped by the crossbar array, achieving in-memory analog computation.[18] In 2009, Xia proposed that switching characteristics of RRAM could also be used as a transfer switch to realize reconfigurable logic.[124] Meanwhile, he proposed that the memristor crossbar could be used as a brain-like synapse and the neuron of silicon-based transistor circuits could be combined to realize brain-like computing, bringing a new application direction to RRAM. In 2010, Lu first accomplished the concept of using nanoscale memristors as biomimetic synapses,[125] and for the first time implemented the STDP learning mechanism on memristors, which drew great attention to the construction of artificial neural networks for emerging non-volatile memory devices. The essence of artificial neural network is the parallelization of memory and computing. The advent of in-memory computing in the memristor crossbar makes the neural network available by analog computing, allowing brain-like computing to take a step further. The basic units of artificial networks are synapses and neurons.[1] Neurons, connecting by synapses with weights, can be divided into multiple layers, as shown in Fig.
Compared with the high energy consumption of the traditional CMOS circuit, the resistive switch device is regarded as the electronic synapse to store weight and transmit signals at the same time, so it is more conducive to exploit the analog in-memory computing with ultra-low power consumption. The two-terminal resistive switch easily participates in data processing while accomplishing the data storage, providing high data throughput that is of great significance for various AI applications. Note that conventional silicon-based computing components are still required for the resistive switch device’s potential commercialization in the near future, the resistive switch devices can be fabricated compatibly over the existing CMOS substrate due to the low thermal budget. A 54 × 108 memristor crossbar was integrated on the top of CMOS circuits including all the necessary interface circuitry, digital buses, and an OpenRISC processor (Fig.
A more biologically-inspired approach, spike neural networks (SNN),[18] can rigorously mimic the mechanism of brain information processing. CMOS circuits have been implemented in hardwares such as IBM’s truenorth[126] and Intel’s Loihi.[127] However, CMOS devices are ultimately unable to achieve the fusion of memory and computing, bringing a large waste of resources. It is important to find artificial synaptic devices with analog synaptic function, while simple two-terminal memristors can accomplish similar functions with obvious reduction in area, complexity, and power consumption. Artificial neurons with leaky integrate-and-fire (LIF) function have been explored by a single memristor device.[45] Furthermore, the lattice polarization dynamics of the ferroelectric layer in FeFET can imitate the learning rules of SNN, like spike-timing-dependent plasticity (STDP).[35] Novel low-dimensional materials also pave the way for SNN synaptic devices,[128] such as CNT synapses applied for unsupervised learning with an STDP scheme.[129] Figure
Since the mechanism of conduction depends on the change of microscopic components, resistive switch devices such as PCM and RRAM will inevitably have certain variances that are often regarded as disadvantageous factors. In the process of IC manufacturing, we strive to reduce the variances to the smallest. However, there are great advantages in realizing the identification of specific objects by the random differences inherent in such solid-state devices. This property of randomness is almost identical to the principle of bio-identification technology, which is defined as physical unclonable function as a hardware. RRAM-based PUF, for example, not only takes advantage of the variability of integrated circuits in the manufacturing, but also uses the randomness inherent in the conductive mechanism of RRAM.[48] This inherent randomness originates from the oxygen vacancies inside the insulating dielectric between the top and bottom electrodes of the RRAM. When a conductive filament is formed and broken, the nanoscale gaps between the oxygen vacancies will inevitably change from different devices (device to device), even between different switching cycles of the same RRAM cell (cycle to cycle), and these changes can be expressed through the RRAM resistance or current. Therefore, compared with other technologies, RRAM possesses more random possibilities. The first reconfigurable PUF chip based on RRAM has been designed (Figs.
Although variability is a fatal flaw for traditional computing, but it is useful in stochastic computing, where it can be used to generate random numbers. Random numbers generations (RNG) are significantly important for stochastic computing,[132] data encryption, and neuromorphic computing.[133] Using the randomness inherent in RRAM, a true random number generator can be prepared, which is more stable and reliable than other pseudo-random number generators that need to provide seeds. Different experiments[134,135] have shown how to generate random numbers through memristors, one of which utilizes the stochastic delay time by exploiting switching variability to improve the quality of the generated random numbers (Fig.
Digital computing is the main function of traditional silicon-based IC chips, and its ability to solve complex computing tasks through basic digital logic has brought a lot of convenience to human life. Compared with the analog computation, published literature is scarce about its array-level experimental demonstration. In order to break through the existing bottleneck of von Neumann while ensuring the calculation accuracy, it is necessary to realize digital logic calculation for general-purpose in-memory computing. According to the number of states stored in the storage unit, the digital calculation can be divided into binary logic calculation based on switching characteristics and multi-bit calculation based on polymorphic storage. From binary to multi-bit, hybrid precision computing can be achieved in in-memory digital computing.
Since the invention of the transistor, switching transistors have a great advantage in binary computing through continual shrinking of feature size, becoming the most fundamental method of information processing technology today. In order to achieve digital calculations with low energy consumption and high area efficiency, new digital calculation concepts such as quantum dots[142] and even single atoms[143] have been proposed one after another, but these new experimental demonstrations have not yet achieved a good control of voltage and current on individual units. Instead, the new material devices like carbon nanotubes[118] and two-dimensional material[119] transistors bring out the promising performance in these respects. Profit from the flawless in the material lattice surface, both carbon nanotubes and two-dimensional material transistors can be heterogeneously integrated in nanoscale. Therefore, a memory cell is conveniently stacked on the logic transistor formed by such new materials to accomplish in situ memory. The three-dimensional integrated in situ memory chip just vertically stacks the bitwise RRAM cell on the top of carbon nanotubes field-effect transistors (CNFETs) so the half-adders constructed by the CNFETs are able to access the stored values immediately to complete conventional binary computing (Fig.
Resistive switch devices such as RRAM and PCM show many advantages of implementing binary calculations. With the simple structure, it is easy to achieve crossbar integration and directly reconfigurable under the control of current and voltage. The high and low conductances can be used to denote the logic variables ‘0’ and ‘1’, respectively. More importantly, it is tantalizing for their facilitating miniaturization and non-volatile memory. Taking RRAM as an example, non-volatile binary logic can be achieved as ‘stateful’ logic.[50] If the resistance is also applied as the input variable so that the physical variable of input and output is uniform, such logic operation is called stateful logic. A basic stateful logic cell usually requires multiple devices to be connected in a certain way. The resistance divider between the devices will determine the final output result. There are two different stateful logic methods (IMP[50,144] and MAGIC[145,146]) and we take IMP for example. The IMP basic logic is composed of two identical memristors P, Q and a voltage dividing resistor RG, as shown in Fig.
In addition to memristors, other resistive switch devices, such as PCM[82] and STT-MRAM,[97] also can achieve similar non-volatile logic operations, which shows the universality of non-volatile logic storage and calculation. Compared with the conventional CMOS logic, non-volatile logic uses the same circuit structure to implement reconfigurable logic by applying different voltages, not only non-volatile, but also flexible reconfiguration. However, the required steps increase with the increase in functional complexity, sacrificing a certain computing efficiency and the power consumption will increase. On the other hand, the traditional silicon-based in-memory computing devices, such as flash, FeFET, and SRAM, can complete Boolean logic and save the data in one operation based on the traditional CMOS logic operation,[108,114] though still sacrifice a certain area and power consumption, which is acceptable for the energy consumption cost by data transporting. Moreover, these novel binary computation technologies are all in the preliminary research stage with many possibilities for improvement. The in situ storage realized by the new material carbon nanotubes and two-dimensional materials can also integrate the memory unit on the computing cell through 3D stacking without changing the logic operation of the conventional transistors, while maintaining the advantages of traditional binary logic it can also perform high-speed parallel storage, even if the related research is still in the early proof-of-concept stage.
The optimization and advancement of memory technology help to realize the storage of more states than 0 and 1, making multiple states computations possible. For devices with conventional memory function, the binary representation is naturally easy to implement because there are only two states: 0 and 1. Part of the reason for using binary is that the states of the memory are limited. The higher degree of freedom of computation puts higher requirements on the memory device, meaning that the device unit should store more states, which is different from the binary switching characteristics. At present, device technologies that can implement multiple state storage include RRAM,[45] PCM,[52] flash,[147] and so on. Usually, a suitable electrical signal such as an accumulated pulse is applied to these memory devices, and the resistance of the device evolves accordingly. By reading out the resistance value at the appropriate time, the corresponding input electrical signal calculation is done. For example, the resistance transformation of PCM is determined by the gradual crystallization of its phase transition layer in the non-crystalline state. Under the continuous stimulation of a voltage pulse, the more crystallization, the lower resistance, so it has phase change accumulation characteristics. As shown in Fig.
Other fascinating applications of in-memory computing that take advantage of the accumulative behavior are prime number decomposition[53] and temporal correlations.[148,149] Finding a factor (M) of the specific number (N) is similar to the 10-based addition. First a specified resistance threshold is set after applying the number of M pulses. Second, the device needs to be RESET every time the resistance reaches the specified threshold while the total number of N pulses are applied to the device sequentially. In the end, if M is a factor of N, the final state of the device will reach the threshold, otherwise it is not. Multiple configured PCM devices can be fed in the same number of N pulses in parallel to validate more different M factors simultaneously (Fig.
The feasibility of in-memory computing cannot be separated from good device performance, large-scale integrated arrays, and applicable system-level algorithms (Fig.
Unlike simple binary memory, non-volatile computing places higher requirements on the performance stability of individual devices, because the requirements for the uniformity and cycle durability of the devices are higher for forcing the computation on memory. The most important issue in device research is uniformity, from device to device or cycle to cycle. Regardless of the two-terminal or three-terminal devices, the complexity and maturity of manufacture will greatly affect the uniformity between different batches even in the same batch of devices. Therefore, the technology based on the traditional silicon-based CMOS process has a great advantage on the advance of manufacture, while the new device technology is still needed to explore the more mature technologies in preparation. The difference in device mechanisms also greatly affects the issue of uniformity. For resistive switches, changes in the material properties of the resistive layers determine the transition of the device state, which is stochastics. It may be acceptable at a sufficiently large size, but due to the randomness of the ions when the device is small enough, movement or thermal activation of defects can cause fluctuations in device parameters, thereby reducing uniformity. For digital computation, the accuracy of the calculation results is very important, so the random fluctuations will reduce the calculation reliability. For analog computation, the computational disturbance caused by the randomness of microscopic particles can be used to avoid falling into the local optimal solution, so the analog computing has a certain tolerance for the variation. In particular, stochastic computing just takes good advantages of this kind of parameter fluctuation randomness to achieve truly random generation, and it is necessary to ensure that the variations are sufficiently stochastic. Of course, the parameters of each device still need to guarantee a reasonable range. The optimization for uniformity is also one of the hot research topics in the world, especially for memristors, such as using interface type memristors to replace conductive filament type memristors,[150] increasing the switching ratio to compensate for the resistance fluctuation, introducing dislocation defects or local doping to limit the position of the shape of the conductive,[151,152] etc.
Due to the high frequency of calculation, the devices need to undergo a large number of repeated operations, which is a big test for the stability of the device. The current response during the switching process (that is, the stored procedure) inevitably produces joule heat, limiting the lifetime of the devices. The lifetime varies for different device technologies. MRAM requires only a small amount of current to operate so it can be repeated up to 1014 times.[94] The endurance of the memristor depends largely on the current in the working state. Uniformly based on the traditional silicon-based MOS process, the cycle durability of flash and SRAM is quite different, because flash requires large power consumption to erase and write, making its lifetime almost the shortest, while SRAM can theoretically perform countless repeated operations. In order to maximize the potential of the device, the selection of its materials and optimization of its structure are critical. For example, for memristors, selecting materials such as tantalum oxide and hafnium oxide from a resistive material system with only two stable chemical phases can effectively improve the cycling durability.[28,153] Moreover, suitable operating voltage,[154] electrode material selection,[155] and algorithm design optimization[156] are also helpful to improve endurance. A decent retention is the basic ability of these memory technologies, but in nanoscale the conductance state of the devices tends to drift with time, temperature, and voltage bias, which is prone to cause computing inaccuracy. The influence of such drift can be alleviated to a certain extent by selecting the device with a large Ion/Ioff ratio.
Apart from variability and endurance, factors affecting the device include read/write speed, power consumption, number of states, symmetry, and linearity. Different device technologies have their own strengths and weaknesses in these aspects (Table
Traditional silicon based CMOS technology has been a mainstream technology for the past 50 years due to its advantages of miniaturization for integration. Large-scale integration is the only way for in-memory computing to go beyond the laboratory and into real life applications. In-memory computing technologies based on flash, SRAM, and FeFET are perfectly compatible with the current CMOS process, allowing the corresponding technologies to continue on the current Moore’s law path. Although there is still a great debate about the slowing down or even the end of Moore’s law, in-memory computing can greatly reduce the transmission power consumption for the problem of memory wall, bringing a certain increase in efficiency to the integrated chips. So exploiting the advantage of silicon-based technology is a feasible solution for large-scale in-memory computing. Using the most advanced 7 nm technology, an SRAM macro for machine learning is built with powerful computation. In addition to continued scaling to improve performance, silicon-based leading technologies can also be transferred to other memory devices’ compatible processes, boosting their productivity. For example, extreme ultraviolet lithography is helpful for mass production of nanoscale cross array, while most physical vapor deposition (PVD) and atomic layer deposition (ALD) techniques can also be used to produce switching layer, dielectric layer, and metal layer with excellent conformity, accurate thickness uniformity, and component control.
Because of the simple device structure, the two-terminal resistive devices can easily be made into crossbar arrays, which have been prepared to achieve certain functions. Some of them are separate-array chips that work with external processing chips,[24] while others are manufactured by fusion with CMOS chips.[23] The current level of arrays can only deal with simple tasks. To further handle complex tasks, the size of arrays needs to be enlarged, whereas some problems will be encountered at this step. Although the crossbar array has the advantage of a simple manufacturing process, when reading the resistance value of the device, the existence of leakage current paths introduces parallel current paths, which may cause incorrect reading results and bring additional power consumption problems. Meanwhile, the crosstalk problem caused by the highly parallel writing of the array will also affect the resistance of unselected devices to a certain extent.[164] Especially for digital logic calculation, since it requires accurate reading of each independent unit, the existence of leakage current will have a great limitation on the logic operation function. All nodes need to be opened when the crossbar network is used for inference in analog computing, so the problem of leakage current does not exist, but the precise control of changing the weight of each node still needs to be avoided as much as possible during the training process. The leakage problem will become more serious as the size of the array increases, limiting the expansion of the array, and thus restricting the increase in functional complexity. The current solution to this problem includes applying a protection voltage during the operation of the array, making the memristor itself non-linear through device optimization, and striding devices in series on the memristor unit to form 1D1R, 1S1R, and 1T1R structures.[165–167] But the cascaded structure will make the unit area larger, especially the 1T1R structure. Resistive switch devices can be made to have very small feature sizes, which is also beneficial to reduce the required area to achieve the expansion of the array size, but this also introduces the problem of line resistance and capacitance. Since the interconnect metal wire is not an ideal conductor, parasitic capacitance resistance will bring RC delay and line resistance voltage division. On the one hand, it will increase the circuit delay. At the same time, the uneven voltage distribution may make the devices farther away from the power supply voltage and not work properly, resulting in a decrease in the reliability of operation.[168] In order to alleviate this problem, we can increase the width and thickness of the metal interconnect,[169] increase the access point of the power supply and ground wire,[170] exploit the novel materials like carbon nanotubes and graphene,[171] etc., all of which will of course have area consumption. Another option is to divide the array into many small arrays, connected horizontally in two dimensions or stacked in three dimensions, so that the effect of the resistance and capacitance in such small array can be eliminated. In the face of these challenges encountered in large-scale array integration, how to compromise area efficiency and computing efficiency is the main contradiction. The benefits of crossbar implemented by two-terminal devices is that it can be stacked in 3D, which undoubtedly can relax the requirement of area efficiency,[172] but it also needs to improve the preparation process.
Methods of stacking computing and memory together utilizing two-dimensional materials or carbon nanotubes are also a potential solution to the problem of memory-accessed bottleneck. By taking advantage of the fact that both the upper and lower surfaces of the two-dimensional materials can be used as input channels, and adding a floating gate as the storage layer, a transistor can be used to realize OR or AND logic gate, which has high area utilization while reducing the transport bottleneck.[38] Stanford’s ILV (inter-layer via) 3D IC system is also moving beyond the university laboratory toward commercialization and large-scale application.[37] Unlike TSV 3D IC, ILV 3D IC does not stack multiple chips by packaging, but directly implements multiple chips (monolithic 3D IC) on a single wafer. Carbon nanotubes are connected to the underlying CMOS via the interlayer dielectric layer (ILD), enabling instant access to the CMOS circuit calculated results. ILV can achieve the interconnection density that 3D IC can achieve, easily reaching the feature sizes down to tens of nanometers, thus greatly improving the performance of the overall chip system. But the integration scale of these new materials is still too small for current SoCs. If carbon nanotubes and two-dimensional materials want to walk into the mainstream and promote large-scale manufacturing yield, the design of standard cell library for these new devices, as well as the EDA tools and processes,[173,174] is needed, all of which are actually a problem of design methodology and industrial ecosystems.
To further explore the practical application of the in-memory computing, peripheral control circuits and system-level algorithms are the other two key issues that hinder the implementation of system-level chips. The peripheral circuits of most prototype concept devices are based on mature CMOS technology, which is friendly to flash, SRAM, and FeFETs compatible with silicon-based processes, but not necessarily compatible with resistive switch devices and new material devices for non-traditional manufacturing processes. Because the research of in-memory digital logic computing mainly stays at the device level, there are basically no large-scale arrays at present, and the peripheral circuit and algorithm applications at the system level are still blank, which needs more research to further realize the application. Here we mainly discuss the system-level peripheral circuits and algorithms of in-memory analog computing. Most of the research on analog computing focuses on the integration of device arrays,[25] and rarely explores the optimal design of peripheral,[23] whether it is an independent device array or an array chip integrated with silicon. Due to the need for precise programming control over the storage state (usually the conductance state) of each device, the most commonly used and important peripheral circuits are digital-to-analog converters and analog-to-digital converters.[23–25,175,176] The results of analog calculations still need to be transmitted to other digital peripheral circuits for integration processing and then feedback into the arrays, which already accounts for 60% energy consumption of the overall system, thus requiring low-power ADCs and DACs.[177] Because the matrix-vector product that originally consumed the most energy in the algorithm can now be completed in one step in the crossbar array, it seems that the multi-step processing of the peripheral circuits in the algorithm is more redundant, and the proportion of energy consumption in this part is also relatively large. Considering the various random migration processes, the device units have high mismatches in the practical manufacturing process. In order to compensate for the shortcoming of these mismatches, when completing the multi-bit calculation, additional anti-maladjustment and mismatch compensation circuits are usually required, or compensation can be performed by online learning to update neural network parameters. One solution to alleviate the requirement for peripheral circuit is to improve the robustness of the arrays. By combining high-performance crossbar arrays with a hybrid-training method, the implementation of five-layer CNN to perform MNIST image recognition with a high accuracy of 96.19% was achieved in eight uniform 2048-cells memristor arrays, providing a feasible scheme for greatly improving the efficiency of CNN (Fig.
If the algorithm can reduce or optimize data process steps other than the matrix-vector product, the energy consumption and required area will be reduced even more. The closer the hardware is to the algorithm, the better the application of the algorithm will be.[19] As the most explored memory technology for in-memory computing, RRAM has been used to prepare many experimental demonstrations for artificial neural network applications from single-layer perceptron,[39,40] sparse coding[43] to reinforcement learning[25] and convolution neural network[24] (Fig.
As an important driving force for the development of the information society in the future, the core of AI is to deal with huge amounts of data, which has led to a continuous growing search for new types of computing. In-memory computing, a non-von Neumann computing architecture, exhibits superior computational performance because it breaks through the limitations of memory wall by completely eliminating the energy and time required for data transport. Hardware demonstrations of different memory technologies have made remarkable progress (Table
[1] | |
[2] | |
[3] | |
[4] | |
[5] | |
[6] | |
[7] | |
[8] | |
[9] | |
[10] | |
[11] | |
[12] | |
[13] | |
[14] | |
[15] | |
[16] | |
[17] | |
[18] | |
[19] | |
[20] | |
[21] | |
[22] | |
[23] | |
[24] | |
[25] | |
[26] | |
[27] | |
[28] | |
[29] | |
[30] | |
[31] | |
[32] | |
[33] | |
[34] | |
[35] | |
[36] | |
[37] | |
[38] | |
[39] | |
[40] | |
[41] | |
[42] | |
[43] | |
[44] | |
[45] | |
[46] | |
[47] | |
[48] | |
[49] | |
[50] | |
[51] | |
[52] | |
[53] | |
[54] | |
[55] | |
[56] | |
[57] | |
[58] | |
[59] | |
[60] | |
[61] | |
[62] | |
[63] | |
[64] | |
[65] | |
[66] | |
[67] | |
[68] | |
[69] | |
[70] | |
[71] | |
[72] | |
[73] | |
[74] | |
[75] | |
[76] | |
[77] | |
[78] | |
[79] | |
[80] | |
[81] | |
[82] | |
[83] | |
[84] | |
[85] | |
[86] | |
[87] | |
[88] | |
[89] | |
[90] | |
[91] | |
[92] | |
[93] | |
[94] | |
[95] | |
[96] | |
[97] | |
[98] | |
[99] | |
[100] | |
[101] | |
[102] | |
[103] | |
[104] | |
[105] | |
[106] | |
[107] | |
[108] | |
[109] | |
[110] | |
[111] | |
[112] | |
[113] | |
[114] | |
[115] | |
[116] | |
[117] | |
[118] | |
[119] | |
[120] | |
[121] | |
[122] | |
[123] | |
[124] | |
[125] | |
[126] | |
[127] | |
[128] | |
[129] | |
[130] | |
[131] | |
[132] | |
[133] | |
[134] | |
[135] | |
[136] | |
[137] | |
[138] | |
[139] | |
[140] | |
[141] | |
[142] | |
[143] | |
[144] | |
[145] | |
[146] | |
[147] | |
[148] | |
[149] | |
[150] | |
[151] | |
[152] | |
[153] | |
[154] | |
[155] | |
[156] | |
[157] | |
[158] | |
[159] | |
[160] | |
[161] | |
[162] | |
[163] | |
[164] | |
[165] | |
[166] | |
[167] | |
[168] | |
[169] | |
[170] | |
[171] | |
[172] | |
[173] | |
[174] | |
[175] | |
[176] | |
[177] | |
[178] | |
[179] | |
[180] | |
[181] | |
[182] |